Skip to content

Gemma3 finetune后model.safetensors.index.json和原来不一样,导致无法使用vllm推理 #8243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
junleiz opened this issue May 31, 2025 · 8 comments
Closed
1 task done
Labels
solved This problem has been already solved

Comments

@junleiz
Copy link

junleiz commented May 31, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.3.dev0
Platform: Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.28
Python version: 3.11.0
PyTorch version: 2.6.0+cu124 (GPU)
Transformers version: 4.52.3
Datasets version: 3.6.0
Accelerate version: 1.7.0
PEFT version: 0.15.2
TRL version: 0.9.6
GPU type: NVIDIA A100-SXM4-80GB
GPU number: 8
GPU memory: 79.14GB
vLLM version: 0.8.5.post1
Git commit: 2c464f329dcd798a0b6b7aaed4719b67dec0c099
Default data directory: not detected

Reproduction

model

model_name_or_path: /storage/home/westlakeLab/zhangjunlei/models/google/gemma-3-12b-it

method

stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true # choices: [true, false]
freeze_multi_modal_projector: true # choices: [true, false]
freeze_language_model: false # choices: [true, false]
deepspeed: examples/deepspeed/ds_z2_config.json

dataset

dataset: phone_web_0131_fix_merge_1500_wait_scroll_fix_hover
template: gemma3
cutoff_len: 8192
max_samples: 1000000000
overwrite_cache: False
preprocessing_num_workers: 256
dataset_dir: /backup/lanzhenzhongLab/junleizhang/dataset

output

output_dir: /backup/lanzhenzhongLab/junleizhang/output/gemma3_phone_web_0131_fix_merge_1500_wait_scroll_fix_hover
logging_steps: 10
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true
save_total_limit: 1

train

per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 2.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.05
bf16: true
ddp_timeout: 180000000
image_max_pixels: 1048576
report_to: wandb
mix_strategy: concat
use_fast_tokenizer: true
disable_shuffling: true

finetune之后loss正常,但是输出的模型的model.safetensors.index.json和原来模型不一样,导致报错there is no module or parameter named 'lm_head' in gemma3forconditionalgeneration

我检查了一下确实多了一个lm_head,原来模型的model.safetensors.index.json

Image

finetune之后

Image

Others

No response

@junleiz junleiz added bug Something isn't working pending This problem is yet to be addressed labels May 31, 2025
@Kuangdd01
Copy link
Collaborator

Kuangdd01 commented May 31, 2025

原因在这里,原来的gemma都是tie embedding的,hf copy了一份形成了lm_head并且存下来了
https://github.com/huggingface/transformers/blob/51d732709e5ae424e8fb6c4e58b72057a3e413c2/src/transformers/models/gemma3/modeling_gemma3.py#L806-L824

可行的解决方式:
https://github.com/vllm-project/vllm/blob/7782464a1714f6081ca06f47b75e824b14316c72/vllm/model_executor/models/gemma3_mm.py#L696-L699
在这里跳过一下lm_head这个key的load

@junleiz
Copy link
Author

junleiz commented May 31, 2025

非常感谢您的回复

请问有可能训练的时候不保存lm_head么

对vllm不是很熟悉,如果要修改vllm,是改这一行么https://github.com/vllm-project/vllm/blob/7782464a1714f6081ca06f47b75e824b14316c72/vllm/model_executor/models/utils.py#L274

如果name是“lm_head”就跳过?

@junleiz
Copy link
Author

junleiz commented May 31, 2025

那我直接在safetensor.index.json里删掉这个key是不是效果是一样的?

@Kuangdd01
Copy link
Collaborator

我觉得应该不行,应该是根据现在safetensors里的key state来加载的

@junleiz
Copy link
Author

junleiz commented Jun 1, 2025

请问是否可以出一个转换脚本, 对vllm不太熟悉。。担心会改错,我目前尝试用

model_name_or_path: /disk2/output/gemma-3-12b-it_sft
template: gemma3
infer_backend: huggingface # choices: [huggingface, vllm, sglang]
trust_remote_code: true

CUDA_VISIBLE_DEVICES=5,,6 API_PORT=5002 llamafactory-cli api examples/inference/gemma3.yaml
来部署,但是会报错
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/data/users/zhangjunlei/anaconda3/envs/lf/lib/python3.11/site-packages/transformers/generation/utils.py", line 2597, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/data/users/zhangjunlei/anaconda3/envs/lf/lib/python3.11/site-packages/transformers/generation/utils.py", line 3560, in _sample
outputs = model_forward(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/zhangjunlei/anaconda3/envs/lf/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 574, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/data/users/zhangjunlei/anaconda3/envs/lf/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/zhangjunlei/anaconda3/envs/lf/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/zhangjunlei/anaconda3/envs/lf/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1380, in call
return self._torchdynamo_orig_callable(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/zhangjunlei/anaconda3/envs/lf/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 547, in call
return _compile(
^^^^^^^^^
File "/data/users/zhangjunlei/anaconda3/envs/lf/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 925, in _compile
raise RecompileLimitExceeded(f"{limit_type} reached")
torch._dynamo.exc.RecompileLimitExceeded: cache_size_limit reached

@Kuangdd01
Copy link
Collaborator

可以把transformers更新到4.52.4, 4.52.1-3 都有一些bug

@junleiz
Copy link
Author

junleiz commented Jun 2, 2025

请问需要重新训练么

@Kuangdd01
Copy link
Collaborator

请问需要重新训练么

需要

@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Jun 3, 2025
@hiyouga hiyouga closed this as completed Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants